Simulations


I simulate GWAS \(p\)-values for 80,356 SNPs on chromosome 22 (with \(MAF>0.05\)) including 14,234 independent SNPs. For each LD block (24 of these), I use simGWAS to simulate results from 100 GWAS’ (varying the shared CVs).

The example shown below has 52 CVs:

I further define a set of “truly associated” as CVs + those SNPs with \(r^2>0.8\) with ANY of the causals (1227 of these in this example) and a set of “truly not-associated” as a set of SNPs with \(r^2<0.2\) with ALL the causals (75,552 of these in this example). I.e. the vast majority of the SNPs are “truly not-associated”.


Quantification

  • What proportion of significant calls are truly associated?
length(which(x$assoc==1 & x$p<=5*10^-8))/length(which(x$p<=5*10^-8))
## [1] 0.8534031
  • What proportion of significant calls are truly not-associated?
length(which(x$not_assoc==1 & x$p<=5*10^-8))/length(which(x$p<=5*10^-8))
## [1] 0.005235602
  • Power (sensitivity) proxy: What proportion of truly associated are called significant?
length(which(x$assoc==1 & x$p<=5*10^-8))/length(which(x$assoc==1))
## [1] 0.398533
  • Specificity proxy: What proportion of non-significants are truly not associated?
length(which(x$not_assoc==1 & x$p>5*10^-8))/length(which(x$p>5*10^-8))
## [1] 0.9469311

1. Independent uniform


I compare results with empirical cFDR for independent uniform q (no left-censoring or removal of L-curve segments).

Plotting v5 (v-vals after 5 iterations over independent uniform q) against the original p-values:

Based on my 110 simulations (11 simulations per array job take 24 hrs; can also present results as a table):

I’ll make a similar plot for specificity when my simulations have finished.


3. Functional simulations


I now define:

  • CVs as “causals”
  • CVs +/- SNPs 10,000 bp as “functionals”
  • Nulls as non-functionals

I sample functionals from distribution A (rnorm(n, mean = sample(c(-2,-3),1), sd = sample(c(0.5,0.2), 1))) and nulls from distribution B (rnorm(n, mean = sample(c(2,3,4), 1), sample(c(1, 1.5), 1))).

But I need to think about how I can iterate so I am not violating the independence assumption. Perhaps:

  • Iteration 1: “Functionals” from some dist X (e.g. open chromatin functional data)
  • Iteration 2: “CVs” from some dist Y (e.g. TFBS functional data)
  • Iteration 3: “CVs + 100bp” from some dist Z (e.g. chip-seq data)

but this may still be violating the independence assumption..


4. Probabalistic functional simulations


As above but sample nulls from distribution A with probability 0.9 and functionals from distribution A with probability 0.1.


Asthma data dimensionality reduction


NNMF


I have ran non-negative matrix factorisation on the annotation data using the NNLM package (nmf is too slow for matrices of this size). [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6945623/]

res_out <- nnmf(A = as.matrix(annots), k = 50, method = "scd", verbose = 1)

I regress the \(log(p)\) values against the NMF co-ordinates and choose dimensions to iterate over based on the t-values.

Dimension 23 is picked out:

regression_df <- data.frame(chisq = log(p_df$european_ancestry_pval_rand)[ind], coords[ind,])

lm_out <- lm(chisq ~ ., data = regression_df)
summary(lm_out)
## 
## Call:
## lm(formula = chisq ~ ., data = regression_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.512  -0.372   0.326   0.733   1.769 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.975e-01  9.993e-03 -89.810  < 2e-16 ***
## X1           4.265e-04  2.618e-03   0.163 0.870560    
## X2           8.269e-04  9.294e-04   0.890 0.373600    
## X3          -1.681e-03  1.092e-03  -1.538 0.123958    
## X4          -2.056e-03  2.271e-03  -0.905 0.365307    
## X5          -9.684e-03  1.807e-03  -5.358 8.42e-08 ***
## X6          -8.175e-03  5.916e-03  -1.382 0.167018    
## X7           3.196e-03  3.293e-03   0.971 0.331779    
## X8          -3.585e-02  4.211e-03  -8.513  < 2e-16 ***
## X9          -6.462e-02  7.093e-03  -9.110  < 2e-16 ***
## X10          2.286e-03  1.690e-03   1.352 0.176260    
## X11         -1.305e-03  1.396e-03  -0.935 0.349956    
## X12         -6.764e-04  3.519e-04  -1.922 0.054600 .  
## X13         -1.548e-01  1.315e-02 -11.766  < 2e-16 ***
## X14         -2.868e-03  2.546e-03  -1.127 0.259845    
## X15          3.266e-03  1.623e-03   2.013 0.044159 *  
## X16         -1.049e-01  1.448e-02  -7.247 4.25e-13 ***
## X17         -1.139e-03  2.014e-03  -0.566 0.571567    
## X18         -3.271e-02  7.007e-03  -4.668 3.05e-06 ***
## X19          2.390e-03  1.940e-03   1.232 0.217969    
## X20         -2.056e-02  1.940e-03 -10.593  < 2e-16 ***
## X21         -2.559e-03  2.970e-03  -0.862 0.388840    
## X22         -3.953e-04  1.248e-03  -0.317 0.751362    
## X23         -6.048e-03  3.640e-04 -16.617  < 2e-16 ***
## X24         -4.153e-03  1.755e-03  -2.366 0.017976 *  
## X25         -1.695e-02  3.347e-03  -5.065 4.08e-07 ***
## X26         -3.790e-02  3.882e-03  -9.764  < 2e-16 ***
## X27          2.022e-03  1.674e-03   1.208 0.227159    
## X28         -9.915e-03  4.140e-03  -2.395 0.016612 *  
## X29         -3.257e-02  3.230e-03 -10.083  < 2e-16 ***
## X30         -3.268e-04  6.792e-04  -0.481 0.630460    
## X31         -2.700e-03  1.700e-03  -1.588 0.112303    
## X32         -5.452e-02  4.334e-03 -12.581  < 2e-16 ***
## X33         -9.969e-03  2.476e-03  -4.026 5.67e-05 ***
## X34         -5.771e-02  4.989e-03 -11.568  < 2e-16 ***
## X35         -7.829e-03  2.064e-03  -3.793 0.000149 ***
## X36         -2.944e-02  2.500e-03 -11.777  < 2e-16 ***
## X37          1.455e-03  9.734e-05  14.943  < 2e-16 ***
## X38          7.593e-03  4.473e-03   1.697 0.089606 .  
## X39         -2.389e-03  1.535e-03  -1.556 0.119700    
## X40         -1.702e-04  6.289e-04  -0.271 0.786735    
## X41         -3.483e-03  2.049e-03  -1.700 0.089221 .  
## X42          7.219e-03  2.731e-03   2.643 0.008217 ** 
## X43         -3.081e-03  1.156e-03  -2.665 0.007704 ** 
## X44          5.589e-03  2.762e-03   2.023 0.043050 *  
## X45         -2.681e-03  1.155e-03  -2.321 0.020287 *  
## X46          1.210e-02  3.082e-03   3.927 8.60e-05 ***
## X47          4.787e-03  2.886e-03   1.659 0.097126 .  
## X48          4.019e-03  3.400e-03   1.182 0.237196    
## X49          5.680e-03  1.861e-03   3.051 0.002278 ** 
## X50         -2.262e-02  2.222e-03 -10.178  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.12 on 513061 degrees of freedom
## Multiple R-squared:  0.002451,   Adjusted R-squared:  0.002354 
## F-statistic: 25.21 on 50 and 513061 DF,  p-value: < 2.2e-16
t_values <- summary(lm_out)$coefficients[,3]
ordered_dims  <- sort(abs(t_values), decreasing = TRUE)
ordered_dims
## (Intercept)         X23         X37         X32         X36         X13 
##  89.8102720  16.6166604  14.9426067  12.5805594  11.7768096  11.7658887 
##         X34         X20         X50         X29         X26          X9 
##  11.5677220  10.5931708  10.1777566  10.0828724   9.7641775   9.1101797 
##          X8         X16          X5         X25         X18         X33 
##   8.5130299   7.2474122   5.3580564   5.0653892   4.6677431   4.0261768 
##         X46         X35         X49         X43         X42         X28 
##   3.9270036   3.7930278   3.0514003   2.6647892   2.6430475   2.3951889 
##         X24         X45         X44         X15         X12         X41 
##   2.3661286   2.3210084   2.0232305   2.0125844   1.9220476   1.6995265 
##         X38         X47         X31         X39          X3          X6 
##   1.6974834   1.6589515   1.5879284   1.5560350   1.5383735   1.3818517 
##         X10         X19         X27         X48         X14          X7 
##   1.3523616   1.2319484   1.2077098   1.1820246   1.1267586   0.9705385 
##         X11          X4          X2         X21         X17         X30 
##   0.9346764   0.9052997   0.8897520   0.8617245   0.5657452   0.4810802 
##         X22         X40          X1 
##   0.3168444   0.2705533   0.1629472

But this only has a correlation of -0.005 with p…


Non-negative PCA


I regress the \(log(p)\) values against the co-ordinates and choose dimensions to iterate over based on the t-values.

regression_df <- data.frame(chisq = log(p)[ind], coords[ind,])

lm_out <- lm(chisq ~ ., data = regression_df)
summary(lm_out)
## 
## Call:
## lm(formula = chisq ~ ., data = regression_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.487  -0.372   0.326   0.733   1.976 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -1.042e+00  1.916e-03 -543.968  < 2e-16 ***
## PC1          5.039e-03  6.052e-04    8.325  < 2e-16 ***
## PC2         -2.367e-02  1.270e-03  -18.637  < 2e-16 ***
## PC3          2.496e-03  1.721e-03    1.450 0.146959    
## PC4         -9.077e-03  3.198e-03   -2.838 0.004538 ** 
## PC5          3.654e-02  8.196e-03    4.458 8.28e-06 ***
## PC6          3.603e-02  2.442e-03   14.756  < 2e-16 ***
## PC7          2.811e-02  6.427e-03    4.373 1.22e-05 ***
## PC8         -1.866e-02  3.232e-03   -5.772 7.83e-09 ***
## PC9          3.373e-02  4.844e-03    6.963 3.33e-12 ***
## PC10         2.083e-02  4.393e-03    4.741 2.13e-06 ***
## PC11        -3.833e-02  3.957e-03   -9.688  < 2e-16 ***
## PC12         4.832e-02  6.092e-03    7.932 2.16e-15 ***
## PC13        -7.389e-04  3.941e-03   -0.187 0.851270    
## PC14         2.695e-04  5.349e-03    0.050 0.959818    
## PC15        -2.243e-02  5.228e-03   -4.290 1.79e-05 ***
## PC16         5.787e-03  5.476e-03    1.057 0.290646    
## PC17         3.211e-02  6.519e-03    4.925 8.44e-07 ***
## PC18        -1.150e-02  4.828e-03   -2.383 0.017182 *  
## PC19         2.613e-02  5.730e-03    4.559 5.13e-06 ***
## PC20         2.474e-02  6.633e-03    3.730 0.000191 ***
## PC21         1.025e-03  5.579e-03    0.184 0.854191    
## PC22        -8.892e-03  5.643e-03   -1.576 0.115042    
## PC23        -2.849e-02  6.272e-03   -4.542 5.58e-06 ***
## PC24        -3.183e-02  6.551e-03   -4.858 1.18e-06 ***
## PC25        -1.628e-02  6.246e-03   -2.606 0.009151 ** 
## PC26        -4.215e-02  6.493e-03   -6.492 8.47e-11 ***
## PC27        -3.173e-02  6.270e-03   -5.061 4.18e-07 ***
## PC28         1.554e-03  6.009e-03    0.259 0.795969    
## PC29        -2.021e-03  6.175e-03   -0.327 0.743473    
## PC30        -5.353e-03  6.919e-03   -0.774 0.439111    
## PC31        -1.052e-02  5.911e-03   -1.780 0.075129 .  
## PC32        -2.631e-02  6.459e-03   -4.073 4.63e-05 ***
## PC33         2.226e-02  6.345e-03    3.509 0.000450 ***
## PC34        -2.889e-02  6.815e-03   -4.239 2.25e-05 ***
## PC35        -7.135e-05  6.931e-03   -0.010 0.991788    
## PC36         1.859e-02  6.446e-03    2.884 0.003930 ** 
## PC37        -5.004e-02  7.159e-03   -6.990 2.74e-12 ***
## PC38        -5.470e-02  7.197e-03   -7.600 2.97e-14 ***
## PC39         1.220e-02  7.429e-03    1.642 0.100536    
## PC40        -3.146e-03  6.192e-03   -0.508 0.611398    
## PC41        -6.560e-03  7.207e-03   -0.910 0.362646    
## PC42         1.944e-03  6.832e-03    0.285 0.775947    
## PC43        -1.526e-02  7.884e-03   -1.935 0.052952 .  
## PC44         1.650e-03  9.048e-03    0.182 0.855316    
## PC45        -3.833e-02  7.869e-03   -4.871 1.11e-06 ***
## PC46         1.606e-02  8.370e-03    1.919 0.055021 .  
## PC47        -7.412e-05  7.625e-03   -0.010 0.992244    
## PC48        -2.617e-02  8.272e-03   -3.164 0.001558 ** 
## PC49        -6.774e-02  1.017e-02   -6.661 2.72e-11 ***
## PC50        -3.344e-02  1.068e-02   -3.132 0.001737 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.119 on 513061 degrees of freedom
## Multiple R-squared:  0.002908,   Adjusted R-squared:  0.002811 
## F-statistic: 29.93 on 50 and 513061 DF,  p-value: < 2.2e-16
t_values <- summary(lm_out)$coefficients[,3]
ordered_dims  <- sort(abs(t_values), decreasing = TRUE)
ordered_dims
##  (Intercept)          PC2          PC6         PC11          PC1         PC12 
## 543.96825951  18.63734441  14.75601531   9.68774084   8.32505501   7.93195145 
##         PC38         PC37          PC9         PC49         PC26          PC8 
##   7.59991594   6.99044956   6.96322977   6.66139593   6.49220150   5.77226133 
##         PC27         PC17         PC45         PC24         PC10         PC19 
##   5.06054861   4.92493341   4.87126841   4.85847391   4.74114729   4.55933143 
##         PC23          PC5          PC7         PC15         PC34         PC32 
##   4.54189835   4.45787240   4.37339085   4.29022905   4.23860318   4.07344260 
##         PC20         PC33         PC48         PC50         PC36          PC4 
##   3.73022971   3.50861524   3.16358446   3.13183006   2.88369979   2.83812933 
##         PC25         PC18         PC43         PC46         PC31         PC39 
##   2.60638907   2.38279482   1.93531668   1.91871324   1.77967729   1.64226599 
##         PC22          PC3         PC16         PC41         PC30         PC40 
##   1.57593214   1.45036000   1.05670698   0.91033486   0.77369530   0.50807895 
##         PC29         PC42         PC28         PC13         PC21         PC44 
##   0.32725757   0.28460505   0.25856747   0.18749789   0.18377326   0.18233942 
##         PC14         PC35         PC47 
##   0.05038217   0.01029303   0.00972038

Principal component 2 is picked out first:

But this only has a correlation of -0.004 with p…


PCAmix


regression_df <- data.frame(chisq = log(p_df$european_ancestry_pval_rand)[ind], coords[ind,1:50])

lm_out <- lm(chisq ~ ., data = regression_df)
summary(lm_out)
## 
## Call:
## lm(formula = chisq ~ ., data = regression_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.463  -0.372   0.326   0.733   1.322 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -1.041e+00  1.844e-03 -564.544  < 2e-16 ***
## dim.1       -7.927e-03  5.299e-04  -14.960  < 2e-16 ***
## dim.2        1.290e-04  7.392e-04    0.175 0.861448    
## dim.3       -2.216e-03  8.290e-04   -2.673 0.007511 ** 
## dim.4        9.927e-05  8.789e-04    0.113 0.910065    
## dim.5       -8.606e-03  8.646e-04   -9.954  < 2e-16 ***
## dim.6        9.369e-03  9.980e-04    9.388  < 2e-16 ***
## dim.7        9.483e-04  1.019e-03    0.930 0.352139    
## dim.8       -1.024e-02  1.021e-03  -10.023  < 2e-16 ***
## dim.9       -7.218e-03  1.207e-03   -5.979 2.24e-09 ***
## dim.10       6.982e-03  1.110e-03    6.292 3.13e-10 ***
## dim.11      -1.682e-02  1.199e-03  -14.031  < 2e-16 ***
## dim.12       5.467e-03  1.183e-03    4.622 3.80e-06 ***
## dim.13       3.195e-03  1.202e-03    2.659 0.007836 ** 
## dim.14      -1.444e-02  1.217e-03  -11.867  < 2e-16 ***
## dim.15      -8.744e-03  1.277e-03   -6.848 7.47e-12 ***
## dim.16      -9.360e-04  1.354e-03   -0.691 0.489256    
## dim.17       1.245e-03  1.378e-03    0.903 0.366280    
## dim.18      -4.661e-04  1.356e-03   -0.344 0.730982    
## dim.19       8.045e-04  1.430e-03    0.563 0.573774    
## dim.20      -1.064e-03  1.685e-03   -0.631 0.527771    
## dim.21       3.224e-03  1.636e-03    1.971 0.048708 *  
## dim.22       2.503e-03  1.562e-03    1.602 0.109124    
## dim.23       7.188e-06  1.597e-03    0.005 0.996407    
## dim.24      -2.258e-03  1.540e-03   -1.466 0.142555    
## dim.25      -2.710e-04  1.493e-03   -0.182 0.855951    
## dim.26      -1.455e-03  1.449e-03   -1.004 0.315244    
## dim.27       5.152e-04  1.395e-03    0.369 0.711793    
## dim.28      -7.076e-03  1.354e-03   -5.227 1.72e-07 ***
## dim.29      -5.331e-04  1.340e-03   -0.398 0.690640    
## dim.30       3.608e-03  1.442e-03    2.503 0.012323 *  
## dim.31       5.673e-03  1.425e-03    3.980 6.89e-05 ***
## dim.32       4.833e-03  1.432e-03    3.375 0.000739 ***
## dim.33      -5.546e-03  1.435e-03   -3.863 0.000112 ***
## dim.34       1.047e-03  1.431e-03    0.731 0.464492    
## dim.35       3.089e-03  1.441e-03    2.143 0.032106 *  
## dim.36      -4.905e-04  1.454e-03   -0.337 0.735847    
## dim.37       4.267e-04  1.437e-03    0.297 0.766443    
## dim.38       2.691e-03  1.434e-03    1.876 0.060692 .  
## dim.39      -8.609e-04  1.420e-03   -0.606 0.544350    
## dim.40      -3.962e-03  1.415e-03   -2.800 0.005104 ** 
## dim.41       1.717e-03  1.423e-03    1.207 0.227563    
## dim.42       4.741e-04  1.478e-03    0.321 0.748352    
## dim.43      -2.994e-03  1.517e-03   -1.974 0.048436 *  
## dim.44       1.459e-03  1.535e-03    0.950 0.341897    
## dim.45       3.016e-03  1.525e-03    1.978 0.047961 *  
## dim.46      -1.168e-03  1.545e-03   -0.756 0.449377    
## dim.47       2.013e-03  1.557e-03    1.293 0.196058    
## dim.48      -1.057e-03  1.553e-03   -0.681 0.496079    
## dim.49      -1.792e-03  1.559e-03   -1.150 0.250324    
## dim.50      -8.265e-03  1.510e-03   -5.473 4.43e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.12 on 513061 degrees of freedom
## Multiple R-squared:  0.002202,   Adjusted R-squared:  0.002105 
## F-statistic: 22.64 on 50 and 513061 DF,  p-value: < 2.2e-16
# pick dimension with biggest absolute t statistic
t_values <- summary(lm_out)$coefficients[,3]
ordered_dims  <- sort(abs(t_values), decreasing = TRUE)
ordered_dims
##  (Intercept)        dim.1       dim.11       dim.14        dim.8        dim.5 
## 5.645443e+02 1.495963e+01 1.403073e+01 1.186671e+01 1.002277e+01 9.954240e+00 
##        dim.6       dim.15       dim.10        dim.9       dim.50       dim.28 
## 9.388487e+00 6.848461e+00 6.292132e+00 5.979472e+00 5.473031e+00 5.227158e+00 
##       dim.12       dim.31       dim.33       dim.32       dim.40        dim.3 
## 4.622307e+00 3.980195e+00 3.863477e+00 3.374887e+00 2.800387e+00 2.673290e+00 
##       dim.13       dim.30       dim.35       dim.45       dim.43       dim.21 
## 2.659058e+00 2.502761e+00 2.143097e+00 1.977717e+00 1.973531e+00 1.971139e+00 
##       dim.38       dim.22       dim.24       dim.47       dim.41       dim.49 
## 1.875740e+00 1.602144e+00 1.466345e+00 1.292867e+00 1.206662e+00 1.149563e+00 
##       dim.26       dim.44        dim.7       dim.17       dim.46       dim.34 
## 1.004281e+00 9.504247e-01 9.304491e-01 9.034641e-01 7.564546e-01 7.314712e-01 
##       dim.16       dim.48       dim.20       dim.39       dim.19       dim.29 
## 6.914926e-01 6.806721e-01 6.314134e-01 6.062492e-01 5.625025e-01 3.979865e-01 
##       dim.27       dim.18       dim.36       dim.42       dim.37       dim.25 
## 3.694496e-01 3.438194e-01 3.373585e-01 3.208131e-01 2.970309e-01 1.815306e-01 
##        dim.2        dim.4       dim.23 
## 1.745319e-01 1.129572e-01 4.502561e-03

This looks to be the most promising:


But then when we do the next regression to pick out the next most significant dimension, there is a non-monotonic relationship (and we know that none of the cFDR methods can simultaneously decrease \(v\)-values for low and high \(q\)).

## 
## Call:
## lm(formula = chisq ~ ., data = regression_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.349  -0.375   0.328   0.733   1.501 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -1.0481724  0.0018385 -570.137  < 2e-16 ***
## dim.2       -0.0022101  0.0007426   -2.976 0.002921 ** 
## dim.3        0.0097650  0.0008305   11.758  < 2e-16 ***
## dim.4        0.0064289  0.0008829    7.281 3.31e-13 ***
## dim.5       -0.0227496  0.0008670  -26.240  < 2e-16 ***
## dim.6        0.0024284  0.0010021    2.423 0.015375 *  
## dim.7       -0.0051504  0.0010232   -5.033 4.82e-07 ***
## dim.8       -0.0174233  0.0010261  -16.980  < 2e-16 ***
## dim.9       -0.0161535  0.0012117  -13.331  < 2e-16 ***
## dim.10      -0.0050019  0.0011137   -4.491 7.08e-06 ***
## dim.11      -0.0126495  0.0012045  -10.502  < 2e-16 ***
## dim.12       0.0035096  0.0011884    2.953 0.003145 ** 
## dim.13       0.0052779  0.0012071    4.372 1.23e-05 ***
## dim.14      -0.0254585  0.0012211  -20.850  < 2e-16 ***
## dim.15      -0.0154245  0.0012822  -12.030  < 2e-16 ***
## dim.16      -0.0045766  0.0013600   -3.365 0.000765 ***
## dim.17       0.0025280  0.0013848    1.826 0.067910 .  
## dim.18       0.0018608  0.0013619    1.366 0.171855    
## dim.19       0.0035365  0.0014370    2.461 0.013852 *  
## dim.20       0.0003097  0.0016934    0.183 0.854897    
## dim.21       0.0048991  0.0016435    2.981 0.002874 ** 
## dim.22       0.0053037  0.0015697    3.379 0.000728 ***
## dim.23       0.0004700  0.0016042    0.293 0.769522    
## dim.24      -0.0012644  0.0015472   -0.817 0.413805    
## dim.25      -0.0008798  0.0015002   -0.586 0.557566    
## dim.26      -0.0021937  0.0014556   -1.507 0.131778    
## dim.27      -0.0010790  0.0014013   -0.770 0.441316    
## dim.28      -0.0065684  0.0013603   -4.829 1.37e-06 ***
## dim.29      -0.0017229  0.0013460   -1.280 0.200549    
## dim.30       0.0067465  0.0014484    4.658 3.20e-06 ***
## dim.31       0.0062229  0.0014323    4.345 1.39e-05 ***
## dim.32       0.0021651  0.0014389    1.505 0.132415    
## dim.33      -0.0088083  0.0014423   -6.107 1.02e-09 ***
## dim.34       0.0059362  0.0014376    4.129 3.64e-05 ***
## dim.35       0.0028960  0.0014482    2.000 0.045525 *  
## dim.36      -0.0025305  0.0014610   -1.732 0.083263 .  
## dim.37      -0.0018640  0.0014435   -1.291 0.196603    
## dim.38       0.0034859  0.0014413    2.418 0.015586 *  
## dim.39      -0.0061993  0.0014267   -4.345 1.39e-05 ***
## dim.40      -0.0026108  0.0014218   -1.836 0.066313 .  
## dim.41       0.0033790  0.0014298    2.363 0.018116 *  
## dim.42      -0.0002966  0.0014851   -0.200 0.841712    
## dim.43      -0.0080105  0.0015243   -5.255 1.48e-07 ***
## dim.44      -0.0071324  0.0015414   -4.627 3.71e-06 ***
## dim.45       0.0090286  0.0015321    5.893 3.80e-09 ***
## dim.46      -0.0022935  0.0015521   -1.478 0.139493    
## dim.47       0.0030492  0.0015647    1.949 0.051329 .  
## dim.48       0.0005422  0.0015602    0.348 0.728189    
## dim.49      -0.0025720  0.0015666   -1.642 0.100639    
## dim.50      -0.0110954  0.0015173   -7.312 2.63e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.125 on 513062 degrees of freedom
## Multiple R-squared:  0.004647,   Adjusted R-squared:  0.004552 
## F-statistic: 48.89 on 49 and 513062 DF,  p-value: < 2.2e-16
## (Intercept)       dim.5      dim.14       dim.8       dim.9      dim.15 
## 570.1366779  26.2395940  20.8495245  16.9804253  13.3311349  12.0295178 
##       dim.3      dim.11      dim.50       dim.4      dim.33      dim.45 
##  11.7579051  10.5020654   7.3124673   7.2814326   6.1069464   5.8927547 
##      dim.43       dim.7      dim.28      dim.30      dim.44      dim.10 
##   5.2551898   5.0333674   4.8286974   4.6577501   4.6271788   4.4913165 
##      dim.13      dim.39      dim.31      dim.34      dim.22      dim.16 
##   4.3723686   4.3453439   4.3447590   4.1291304   3.3788782   3.3652654 
##      dim.21       dim.2      dim.12      dim.19       dim.6      dim.38 
##   2.9809244   2.9759797   2.9532475   2.4610764   2.4234276   2.4184841 
##      dim.41      dim.35      dim.47      dim.40      dim.17      dim.36 
##   2.3632491   1.9997722   1.9487252   1.8363028   1.8256055   1.7320637 
##      dim.49      dim.26      dim.32      dim.46      dim.18      dim.37 
##   1.6417677   1.5071283   1.5046487   1.4776837   1.3662705   1.2912927 
##      dim.29      dim.24      dim.27      dim.25      dim.48      dim.23 
##   1.2799920   0.8172175   0.7699732   0.5864616   0.3475355   0.2930006 
##      dim.42      dim.20 
##   0.1997036   0.1828739


GenoWAP


Method

We downloaded GenoCanyon scores for the 1,978,016 SNPs in the Asthma data set, which represent the union of functional elements in different cell types. The GenoWAP software requires a `threshold’ parameter defining functional loci according to the functional score, for which we used the default recommended value of 0.1, which corresponds to 40% of the SNPs in our data set being “functional”. We then ran the GenoWAP python script to obtain posterior scores for each SNP.

We also ran functional cFDR using the GenoCanyon scores as the auxiliary functional data.


Results

Of the 1,978,016 SNPs in the smaller Asthma GWAS data set (our “principal trait”), 1,971,252 were also found in the larger UKBB Asthma data set.

For the 4269 SNPs that reached genome-wide significance in the larger UKBB GWAS, we compared the rank of the posterior scores from GenoWAP and the v-values from functional cFDR with the rank of the original \(p\)-value in the smaller Asthma GWAS data set. When using GenoWAP, 60.6% (2590/4269) had an improved or equal rank compared to 48.3% when using functional cFDR (2063/4269).

The venn diagram below shows the number of UKBB significant SNPs that recieved an improved or equal rank after applying each of the two methods (the white area outside should be 1403).


Similarly, of the 1,966,983 SNPs that did not reach genome-wide significance in the larger UKBB GWAS, 60.6% had a decreased or equal rank after applying GenoWAP compared to 30.6% when using functional cFDR. The cross over of these is shown in the venn diagram below.


Whilst the pattern of the change in ranking is similar after applying the two methods, the rankings alter more when applying the GenoWAP method. Our method intrinsically determines the relevance of the auxiliary data and will reweight the \(p\)-values accordingly. In the GenoWAP example, the correlation between the \(p\)-values and the GenoCanyon scores (\(q\)) is very low (0.017) indicating that deriving a score that correlates with all GWAS results is a very difficult problem and that most “scores” of this type only capture a small proportion of the variability. This highlights the need for iterative methods that can leverage several layers of auxiliary data in a statistically robust framework.

Note: In FINDOR “To ensure a fair comparison to our approach, we ranked SNPs based on the resulting posterior probability and counted the number of independent GWAS loci in the top K SNPs, where K was the total number of genome-wide significant SNPs reported by FINDOR for the corresponding functional criteria.”

Also in GenoWAP: “We ranked the 298 321 SNPs in the NIDDK study based on their P-values and posterior scores, respectively. Then, within each of the 71 loci, we compared the rank of the lowest P-value to the rank of the largest posterior score. 56 out of 71 loci (79%) had an improved rank, 3 loci (4%) had an equal rank, while only 12 loci (17%) had a reduced rank (Supplementary Table S3). The probability of having an increased rank is significantly higher than that of having a decreased rank (P-value¼ 3:11 108, one-sided binomial test).” (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5963360/pdf/btv610.pdf)

Think I need to compare at the loci level.


Comments and queries


  • Perhaps need to look at both simulation results and Asthma results in terms of loci rather than significant SNPs (especially if comparing to GenoWAP)? How should I define my regions? E.g. FINDOR:

“The primary metric of interest in both real and simulated data was the number of independent GWAS loci (at a level of p < 5 × 10−8) that the various methodologies identified. We conservatively define independent loci using PLINK’s LD-clumping algorithm with a 5 Mb window and an r2 threshold of 0.01.”

Useful links: https://privefl.github.io/bigsnpr/articles/pruning-vs-clumping.html; https://www.cog-genomics.org/plink/1.9/postproc#clump; https://www.cog-genomics.org/plink/1.9/formats#assoc; https://www.cog-genomics.org/plink/1.9/formats#clumped

  • Need to think about a name for the method. Extention to continuous covariates could be called “continuous cFDR” (maybe ccFDR?) and then the application to using functional data with GWAS p-values could be “functional cFDR”?

  • R package: I’m thinking of calling it ccFDR with only a couple of “all-in-one” functions (rather than James’ that has different functions for L curve generation and integration):

#' cFDR for continuous covariates
#'
#' @title ccFDR
#' @param p p values for principal trait (vector of length n)
#' @param q auxillary functional data values (vector of length n)
#' @param indep_index indices of independent SNPs
#' @param nxbin number of bins in zp direction for hex-binning
#' @param res_p number of test points in x-direction (p)
#' @param res_q number of test points in y-direction (q)
#' @param gridp number of data points required in a KDE grid point for left-censoring
#' @param splinecorr logical value for whether spline correction should be implemented
#'
#' @return dataframe of p-values, q-values and v-values
#' @export
ccFDR <- function(p, q, indep_index, nxbin = 1000, res_p = 300, res_q = 500, gridp = 50, splinecorr = TRUE){ ... }